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Abstract 

We consider the framework of stochastic multi-armed bandit problems and study 
the possibilities and limitations of forecasters that perform an on-line explo- 
ration of the arms. These forecasters are assessed in terms of their simple regret, 
a regret notion that captures the fact that exploration is only constrained by the 
number of available rounds (not necessarily known in advance), in contrast to 
the case when the cumulative regret is considered and when exploitation needs 
to be performed at the same time. We believe that this performance criterion 
is suited to situations when the cost of pulling an arm is expressed in terms 
of resources rather than rewards. We discuss the links between the simple and 
the cumulative regret. One of the main results in the case of a finite number 
of arms is a general lower bound on the simple regret of a forecaster in terms 
of its cumulative regret: the smaller the latter, the larger the former. Keeping 
this result in mind, we then exhibit upper bounds on the simple regret of some 
forecasters. The paper ends with a study devoted to continuous-armed bandit 
problems; we show that the simple regret can be minimized with respect to a 
family of probability distributions if and only if the cumulative regret can be 
minimized for it. Based on this equivalence, we are able to prove that the sep- 
arable metric spaces are exactly the metric spaces on which these regrets can 
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be minimized with respect to the family of all probability distributions with 
continuous mean-payoff functions. 

Keywords: Multi-armed bandits, Continuous-armed bandits, Simple regret, 
Efficient exploration 



1. Introduction 

Learning processes usually face an exploration versus exploitation dilemma, 
since they have to get information on the environment (exploration) to be able 
to take good actions (exploitation). A key example is the multi- armed bandit 
problem [19| . a sequential decision problem where, at each stage, the forecaster 
has to pull one out of K given stochastic arms and gets a reward drawn at 
random according to the distribution of the chosen arm. The usual assessment 
criterion of a forecaster is given by its cumulative regret, the sum of differences 
between the expected reward of the best arm and the obtained rewards. Typical 
good forecasters, like UCB trade off between exploration and exploitation. 

Our setting is as follows. The forecaster may sample the arms a given number 
of times n (not necessarily known in advance) and is then asked to output a 
recommended arm. He is evaluated by his simple regret, that is, the difference 
between the average payoff of the best arm and the average payoff obtained 
by his recommendation. The distinguishing feature from the classical multi- 
armed bandit problem is that the exploration phase and the evaluation phase 
are separated. We now illustrate why this is a natural framework for numerous 
applications. 

Historically, the first occurrence of multi-armed bandit problems was given 
by medical trials. In the case of a severe disease, ill patients only are included 
in the trial and the cost of picking the wrong treatment is high (the associated 
reward would equal a large negative value). It is important to minimize the 
cumulative regret, since the test and cure phases coincide. However, for cosmetic 
products, there exists a test phase separated from the commercialization phase, 
and one aims at minimizing the regret of the commercialized product rather 
than the cumulative regret in the test phase, which is irrelevant. (Here, several 
formula? for a cream are considered and some quantitative measurement, like 
skin moisturization, is performed.) 

The pure exploration problem addresses the design of strategies making the 
best possible use of available numerical resources (e.g., as CPU time) in order to 
optimize the performance of some decision-making task. That is, it occurs in 
situations with a preliminary exploration phase in which costs are not measured 
in terms of rewards but rather in terms of resources, that come in limited budget. 

A motivating example concerns recent works on computer-go (e.g., the MoGo 
program [loj). A given time, i.e., a given amount of CPU times is given to the 
player to explore the possible outcome of sequences of plays and output a final 
decision. An efficient exploration of the search space is obtained by considering 
a hierarchy of forecasters minimizing some cumulative regret - see, for instance, 
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the UCT strategy 14 1 and the bast strategy Q- However, the cumulative regret 
does not seem to be the right way to base the strategies on, since the simulation 
costs are the same for exploring all options, bad and good ones. This observation 
was actually the starting point of the notion of simple regret and of this work. 

A final related example is the maximization of some function /, observed 
with noise, see, e.g., 12, (jj. Whenever evaluating / at a point is costly (e.g. 



in terms of numerical or financial costs), the issue is to choose as adequately 
as possible where to query the value of this function in order to have a good 
approximation to the maximum. The pure exploration problem considered here 
addresses exactly the design of adaptive exploration strategies making the best 
use of available resources in order to make the most precise prediction once all 
resources are consumed. 

As a remark, it also turns out that in all examples considered above, we 
may impose the further restriction that the forecaster ignores ahead of time the 
amount of available resources (time, budget, or the number of patients to be 
included) - that is, we seek for anytime performance. 

The problem of pure exploration presented above was referred to as "bud- 
geted multi-armed bandit problem" in the open problem [l6| (where, however, 
another notion of regret than simple regret is considered) . The pure exploration 
problem was solved in a minmax sense for the case of two arms only and re- 
wards given byprobability distributions over [0, 1] in [2(J. A related setting is 
considered in Q and [ljj, where forecasters perform exploration during a ran- 
dom number of rounds T and aim at identifying an e-best arm. These articles 
study the possibilities and limitations of policies achieving this goal with over- 
whelming 1 — 8 probability and indicate in particular upper and lower bounds 
on (the expectation of) T. Another related problem is the identification of the 
best arm (with high probability). However, this binary assessment criterion (the 
forecaster is either right or wrong in recommending an arm) does not capture 
the possible closeness in performance of the recommended arm compared to 
the optimal one, which the simple regret does. Moreover unlike the latter, this 
criterion is not suited for a distribution- free analysis. 



Contents and structure of the paper 

We present formally the model in Section [5] and indicate therein that our 
aim is to study the links between the simple and the cumulative regret. In- 
tuitively, an efficient allocation strategy for the simple regret should rely on 
some exploration-exploitation trade-off but the rest of the paper shows that 
this trade-off is not exactly the same as in the case of the cumulative regret. 

Our first main contribution (Theorem [U Section [3|) is a lower bound on 
the simple regret in terms of the cumulative regret suffered in the exploration 
phase, which shows that the minimal simple regret is larger as the bound on 
the cumulative regret is smaller. This in particular implies that the uniform 
exploration of the arms is a good benchmark when the number of exploration 
rounds n is large. 
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In Section 2] we then study the simple regret of some natural forecasters, 
including the one based on uniform exploration, whose simple regret vanished 
exponentially fast. (Note: The upper bounds presented in this paper can how- 
ever be improved by the recent results of Q.) In Section [SJ we show how one 
can somewhat circumvent the fundamental lower bound indicated above: some 
strategies designed to have a small cumulative regret can outperform (for small 
or moderate values of n) strategies with exponential rates of convergence for 
their simple regret; this is shown both by means of a theoretical study and by 
simulations. 

Finally we investigate in Section [5] the continuous-armed bandit problem 
where the set of arms is a topological space. In this setting we use the simple 
regret as a tool to prove that the separable metric spaces are exactly the metric 
spaces for which it is possible to have a sublinear cumulative regret with re- 
spect to the family of all probability distributions with continuous mean-payoff 
functions. This would be our second main contribution. 

2. Problem setup, notation, structure of the paper 

We consider a sequential decision problem given by stochastic multi-armed 
bandits. A finite number K ^ 2 of arms, denoted by i = 1, . . . , K , are available 
and the i-th of them is parameterized by a fixed (unknown) probability distri- 
bution Vi over [0, 1], with expectation denoted by fii. At those rounds when it is 
pulled, its associated reward is drawn at random according to Vi, independently 
of all previous rewards. For each arm i and all time rounds n ^ 1, we denote 
by Ti(n) the number of times arm i was pulled from rounds 1 to n, and by 
Xi t 2, • • ■ , X i T .^ the sequence of associated rewards. 

The forecaster has to deal simultaneously with two tasks, a primary one and 
a secondary one. The secondary task consists in exploration, i.e., the forecaster 
should indicate at each round t the arm I t to be pulled, based on past rewards 
(so that It is a random variable) . Then the forecaster gets to see the associated 
reward Y t , also denoted by Xj t T It (t) with the notation above. The sequence of 
random variables (I t ) is referred to as an allocation strategy. The primary task 
is to output at the end of each round t a recommendation J t to be used in a 
one-shot instance if/when the environment sends some stopping signal meaning 
that the exploration phase is over. The sequence of random variables (Jt) is 
referred to as a recommendation strategy. In total, a forecaster is given by an 
allocation and a recommendation strategy. 

Figure Q] summarizes the description of the sequential game and points out 
that the information available to the forecaster for choosing I t , respectively Jt, 
is formed by the A,;. s for i = 1, . . . , K and s — 1, . . . , T^(t — 1), respectively, 
,s = 1, ...,T{(t). Note that we also allow the forecaster to use an external 
randomization in the definition of I t and J t . 

As we are only interested in the performances of the recommendation strat- 
egy (Jt), we call this problem the pure exploration problem for multi-armed 
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Parameters: K probability distributions for the rewards of the arms, v\, . . . , vr. 
For each round t = 1,2, ... , 

(1) the forecaster chooses I t £ {1, . . . , K}; 

(2) the environment draws the reward Y t for that action (also denoted by 
Xi t> T T (t) with the notation introduced in the text); 

(3) the forecaster outputs a recommendation J t G {1, . . . , K}; 

(4) if the environment sends a stopping signal, then the game takes an end; 
otherwise, the next round starts. 



Figure 1: The pure exploration problem for multi-armed bandits (with a finite number of 
arms). 

bandits and evaluate the forecaster through its simple regret, defined as follows. 
First, we denote by 

u* = Ui* = max Ui 
i=l,...,K 

the expectation of the rewards of the best arm i* (a best arm, if there are several 
of them with same maximal expectation) . A useful notation in the sequel is the 
gap Aj = ji* — fii between the maximal expected reward and the one of the i-th 
arm; as well as the minimal gap 

A = min Aj . 

i:A;>0 

Now, the simple regret at round n equals the regret on a one-shot instance of 
the game for the recommended arm J n , that is, put more formally, 

r n = H* - fU„ = Aj„ ■ 

A quantity of related interest is the cumulative regret at round n, which is 
defined as 

n 

Rn = ^2^* - fll t ■ 

t—l 

A popular treatment of the multi-armed bandit problems is to construct fore- 
casters ensuring that Ei? n = o(n), see, e.g., or Q, and even R n = o(n) a.s., 
as follows, e.g., from 0, Theorem 6.3] together with the Borel-Cantelli lemma. 
The quantity r' t = fi* — Hh i s sometimes called instantaneous regret. It differs 
from the simple regret rt and in particular, R n = r[ + . . . + r' n is in general not 
equal to r\ + ... + r n . Theorem [TJ among others, will however indicate some 
connections between r n and R n . 

Remark 1. The setting described above is concerned with a finite number of 
arms. In Section [6] we will extend it to the case of arms indexed by a general 
topological space. 
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3. The smaller the cumulative regret, the larger the simple regret 

It is immediate that for well-chosen recommendation strategies, the simple 
regret can be upper bounded in terms of the cumulative regret. For instance, 
the strategy that at time n recommends arm i with probability Ti(n)/n (recall 
that we allow the forecaster to use an external randomization) ensures that the 
simple regret satisfies Er„ = ¥,R n /n. Therefore, upper bounds on Ei? ra lead to 
upper bounds on Er ra . 

We show here that, conversely, upper bounds on Ei? n also lead to lower 
bounds on Er„ : the smaller the guaranteed upper bound on Ei?„ , the larger the 
lower bound on Er„, no matter what the recommendation strategy is. 

This is interpreted as a variation of the "classical" trade-off between explo- 
ration and exploitation. Here, while the recommendation strategy (J n ) relies 
only on the exploitation of the results of the preliminary exploration phase, the 
design of the allocation strategy (It) consists in an efficient exploration of the 
arms. To guarantee this efficient exploration, past payoffs of the arms have 
to be considered and thus, even in the exploration phase, some exploitation is 
needed. Theorem Q] and its corollaries aim at quantifying the needed respective 
amount of exploration and exploitation. In particular, to have an asymptotic 
optimal rate of decrease for the simple regret, each arm should be sampled a 
linear number of times, while for the cumulative regret, it is known that the 
forecaster should not do so more than a logarithmic number of times on the 
suboptimal arms. 

Formally, our main result is reported below in Theorem [TJ It is strong in 
the sense that it lower bounds the simple regret of any forecaster for all possible 
sets of Bernoulli distributions {r/i, . . . , Vk\ over the rewards with parameters 
that are all distinct (no two parameters can be equal) and all different from 1. 
Note however that in particular these conditions entail that there is a unique 
best arm. 

Theorem 1 (Main result). For any forecaster (i.e., for any pair of allocation 
and recommendation strategies) and any function e : {1, 2, ...}—> R such that 

for all (Bernoulli) distributions V\,... ,vk on the rewards, there exists a 
constant C ^ with E7? n ^ C e(n), 

the following holds true: 

for all sets of K 3 Bernoulli distributions on the rewards, with param- 
eters that are all distinct and all different from 1, there exists a constant 
D ^ and an ordering V\, . . . , vk of the considered distributions such that 

Er n >^e~ D ^ . 

We insist on the fact that only sets, that is, unordered collections, of dis- 
tributions are considered in the second part of the statement of the theorem. 
Put differently, we merely show therein that for each ordered if-tuple of dis- 
tributions that are as indicated above, there exists a reordering that leads to 
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the stated lower bound on the simple regret. This is the best result that can 
be achieved. Indeed, some forecasters are sensitive to the ordering of the dis- 
tributions and might get a zero regret for a significant fraction of the ordered 
.fT-tuples simply because, e.g., their strategy is to constantly pull a given arm, 
which is sometimes the optimal strategy just by chance. To get lower bounds 
in all cases we must therefore allow reorderings of if-tuples (or, equivalently, 
orderings of sets). 

Corollary 1 (General distribution-dependent lower bound). For any fo- 
recaster, and any set of K ^ 3 Bernoulli distributions on the rewards, with pa- 
rameters that are all distinct and all different from 1, there exist two constants 
f3 > and 7^0 and an ordering of the considered distributions such that 

Er n >f3e^ n . 

Theorem [T] is proved below and Corollary [T] follows from the fact that the 
cumulative regret is always bounded by n. To get further the point of the 
theorem, one should keep in mind that the typical (distribution-dependent) rate 
of growth of the cumulative regret of good algorithms, e.g., UCB1 is e[n) — 
Inn. This, as asserted in [HI, is the optimal rate. Hence a recommendation 
strategy based on such allocation strategy is bound to suffer a simple regret 
that decreases at best polynomially fast. We state this result for the slight 
modification UCB(a) of UCB1 stated in Figure and introduced in [l[; its 
proof relies on noting that it achieves a cumulative regret bounded by a large 
enough distribution-dependent constant times e(n) = a Inn. 

Corollary 2 (Distribution-dependent lower bound for UCB(a)). The al- 
location strategy (It) given by the forecaster UCB(a) of Figured ensures that for 
any recommendation strategy (Jt) and all sets of K ^ 3 Bernoulli distributions 
on the rewards, with parameters that are all distinct and all different from 1, 
there exist two constants (3 > and 7 ( independent of a) and an ordering 
of the considered distributions such that 

Er n ^ (3 n" 7Q . 

Proof. The intuitive version of the proof of Theorem[T]is as follows. The basic 
idea is to consider a tie case when the best and worst arms have zero empirical 
means; it happens often enough (with a probability at least exponential in the 
number of times we pulled these arms) and results in the forecaster basically 
having to pick another arm and suffering some regret. Permutations are used to 
control the case of untypical or naive forecasters that would despite all pull an 
arm with zero empirical mean, since they force a situation when those forecasters 
choose the worst arm instead of the best one. 

Formally, we fix the forecaster (a pair of allocation and recommendation 
strategies) and a corresponding function e such that the assumption of the 
theorem is satisfied. We denote by p n — (pi, n) ■ • ■ ,PK,n) the probability distri- 
bution from which J n is drawn at random thanks to an auxiliary distribution. 
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Note that p n is a random vector which depends on Ii , . . . , I n as well as on 
the obtained rewards Y\,. . . ,Y n . We consider below a set of K ^ 3 distinct 
Bernoulli distributions, satisfying the conditions of the theorem; actually, we 
only use below that their parameters are (up to a first ordering) such that 
1 > > [i2 ^ ^3 ^ • • • ^ ^ and [i 2 > I^k (thus, ^ 2 > 0). 

Step introduces another layer of notation. The latter depends on permu- 
tations a of {1, . . . , K}. To have a gentle start, we first describe the notation 
when the permutation is the identity, a = id. We denote by P and E the 
probability and expectation with respect to the original if-tuple u\,...,vk of 
distributions over the arms. For i = 1 (respectively, i — K), we denote by P^id 
and Ei.id the probability and expectation with respect to the if-tuples formed 
by #o> v ii ■ ■ ■ , vk (respectively, 5a,v%,. .. , vk-i, Sq), where <5o denotes the Dirac 
measure on 0. 

For a given permutation a, we consider a similar notation up to a reorder- 
ing, as follows. The symbols P CT and E CT refer to the probability and expecta- 
tion with respect to the K -tuple of distributions over the arms formed by the 
^o— i(i), . . . , v a - Note in particular that the i-th best arm is located in the 
0-(i)-th position. Now, we denote for i = 1 (respectively, i = K) by Pj i<7 and 
E ij(T the probability and expectation with respect to the if-tuple formed by the 
u a- 1 (i)i except that we replaced the best of them, located in the cr(l)-th posi- 
tion, by a Dirac measure on (respectively, the best and worst of them, located 
in the cr(l)-th and a(K)~th positions, by Dirac measures on 0). We provide 
now a proof in six steps. 

Step 1 lower bounds the quantity of interest by an average of the simple 
regrets obtained by reordering, 



where we used that under P^, the index of the best arm is cr(l) and the minimal 
regret for playing any other arm is at least \i\ — \i2- 

Step 2 rewrites each term of the sum over o as the product of three simple 
terms. We use first that Pi j(T is the same as P CT , except that it ensures that arm 
cr(l) has zero reward throughout. Denoting by 




cr 



a 



Ti(n) 




the cumulative reward of the i-th arm till round n, one then gets 




E(t [l - Pa{l).n | C a[1) , n = Oj X P a {C CT(1)>n = 0} 
El l(J [l-Pa(l),n] Pa {Ca(l),n = 0} . 
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Second, repeating the argument from Pi j(T to Fk,<t, 



-p ff (i),„] ^ Ei )<T 1 -p ff (i),„ | C CT (iir),„ = Pi )<T {C ct(k)i „ = 0} 

= E X ,a [1 - Pa(l),n] {C«r(K),n - 0} 

and therefore, 

E a [l-p ff(1) , n ] ^E A> [l-p CT(1): „] Pi, CT {C CT(K) , n = 0} Pa{C a (i),n = 0} • (1) 
Step 3 deals with the second term in the right-hand side of (JTJ) , 

Pi,a {C CT (if),« = 0} = Ei, ff [(1 - Mif) TCT(K)(n) ] > (1 - W f) Kl " T - (K)( '° , 

where the equality can be seen by conditioning on I\ , . . . , I n and then taking 
the expectation, whereas the inequality is a consequence of Jensen's inequality. 
Now, the expected number of times the suboptimal arm a(K) is pulled under 
Pi. (for which er(2) is the optimal arm) is bounded by the regret, by the very 
definition of the latter: (/i2 — ^i,crT a tK)i, n ) ^ Ei jCr i? n . By hypothesis, there 
exists a constant C such that for all a, Ei iCr i?„ ^ C e(n); the constant C in 
the hypothesis of the theorem depends on the (order of the) distributions but 
this can be circumvent by taking the maximum of K\ values to get the previous 
statement. We finally get 

Vi, a {C„ {K) , n = 0} > (1 - ^)CM»)/(m™) _ 
Step 4 lower bounds the third term in the right-hand side of ([1]) as 

We denote by W n = (Ji, Yi, ...,/„, Y n ) the history of pulled arms and obtained 
payoffs up to time n. What follows is reminiscent of the techniques used in 17 1. 
We are interested, in certain realizations w n — (^l? Uii • • • •> ^n? 

y n ) of the history: 

we consider the subset % formed by the elements w n such that whenever <r(l) 
was played, it got a null reward, that is, such that y t = for all indexes t with 
i t = c(l). For all arms j, we then denote by tj(w n ) the realization of Tj{n) 
corresponding to w n . Since the likelihood of an element w n € H under P CT is 
(1 — /ii) t "< 1 )( u, '>) times the one under Pj. )<r , we get 

Pa{C? CT (l),„ = 0} = ^ P CT {W„ = «U 

(1 - Ml ) 4 ^>(-") Pl {W n = Wn } = Ei iCT [(1 - M i) T - (1 > ( ' 



w n en 



The argument is concluded as before, first by Jensen's inequality and then, 
by using that \ii Ei j0 .T (T (i) (n) ^ Ei jCr i?„ ^ C e(n) by definition of the regret and 
the hypothesis put on its control. 
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Step 5 resorts to a symmetry argument to show that as far as the first term 
of the right-hand side of ([1]) is concerned, 



K\ 



Since ¥ Ki<7 only depends on cr(2), . . .,a(K - 1), we denote by f>H 2 ),-MK-t) 
the common value of these probability distributions when <r(l) and <j{K) vary 
(and a similar notation for the associated expectation). We can thus group 
the permutations a two by two according to these (K — 2)-tuples, one of the 
two permutations being defined by u(l) equal to one of the two elements of 
{1, . . . , K} not present in the (K — 2)-tuple, and the other one being such that 
ct(1) equals the other such element. Formally, 



E * 



J2,...,JK-1 



32,--;3K-l 



E p i> 

_j£{l,...,K}\{j2,-,3K-l} 

< E W*>~^[l]=— , 

h,—,3K-l 

where the summations over j2j • ■ • i3k-i are over all possible (K — 2)-tuples of 
distinct elements in {1, . . . , K}. 

Step 6 simply puts all pieces together and lower bounds max E CT r„ by 

a 

'" " '" J Ejf, -[l-p -(l),n] ^ {C*(l),n = 0} Pi, CT {C ff (if),„ = 0} 



A'! 
Mi - ^2 



C/(/i 2 -/iK) 



(1-Mi) 



C/M2 



e(n) 



4. Upper bounds on the simple regret 

In this section, we aim at qualifying the implications of Theorem [1] by point- 
ing out that is should be interpreted as a result for large n only. For moderate 
values of n, strategies not pulling each arm a linear number of times in the ex- 
ploration phase can have a smaller simple regret. To do so, we consider only two 
natural and well-used allocation strategies since the aim of this paper is mostly 
to study the links between the cumulative and simple regret and not really to 
prove the best possible bounds on the simple regret. More sophisticated alloca- 
tion strategies were considered recently in 0] and they can be used to improve 
on the upper bounds on the simple regret presented below. 

The first allocation strategy is the uniform allocation, which we use as a 
simple benchmark; it pulls each arm a linear number of times (see Figure [2] 
for its formal description). The second one is UCB(a) (a variant of UCB1 
introduced in [l[ using an exploration rate parameter a > 1 and described also in 
Figured]). It is designed for the classical exploration-exploitation dilemma (i.e., 
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Uniform allocation (Unif) — Plays all arms one after the other 
For each round t = 1,2, ... , 

pull I t — [t mod K], where [t mod K] denotes the value of t modulo K. 

UCB(a) — Plays at each round the arm with the highest upper confidence bound 
Parameter: exploration factor a > 1 
For each round t — 1,2, ... , 

(1) for each i 6 {1, . . . , K}, if Ti(t — 1) = let B,, t = +oo; otherwise, let 

/ alnt 1 T t (t-1) 

Bi ' t = ^- 1 + ^W^T) where ^ = taTT) £ Xi,s; 

(2) Pull I t £ argmax Bi jt 

i—l,...,K 

(ties broken by choosing, for instance, the arm with smallest index). 



Figure 2: Two allocation strategies. 

it minimizes the cumulative regret) and pulls suboptimal arms a logarithmic 
number of times only. 

In addition to these allocation strategies we consider three recommendation 
strategies, the ones that recommend respectively the empirical distribution of 
plays, the empirical best arm, or the most played arm. They are formally defined 
in Figure [3] 

Table [T] summarizes the distribution-dependent and distribution- free bounds 
we could prove in this paper (the difference between the two families of bounds 
is whether the constants in the bounds can depend or not on the unknown 
distributions Vj). It shows that two interesting couples of strategies are, on the 
one hand, the uniform allocation together with the choice of the empirical best 
arm, and on the other hand, UCB(a) together with the choice of the most played 
arm. The first pair was perhaps expected, the second one might be considered 
more surprising. 

Table [TJ also indicates that while for distribution-dependent bounds, the 
asymptotic optimal rate of decrease for the simple regret in the number n of 
rounds is exponential, for distribution-free bounds, this rate worsens to l/^/ri. 
A similar situation arises for the cumulative regret, see [l5| (optimal Inn rate for 
distribution-dependent bounds) versus [3] (optimal \fn rate for distribution-free 
bounds). 

Remark 2. The distribution-free lower bound in TableQ]follows from a straight- 
forward adaptation of the proof of the lower bound on the cumulative regret in 
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Parameters: the history ii, . . . ,I n of played actions and of their associated rewards 
Yi, . . . ,Y n , grouped according to the arms as , . . . , -^Q,^^) s for i — 1, . . . , ti 

Empirical distribution of plays (EDP) 

Recommends arm i with probability T;(n)/n, that is, draws J n at random accord- 
ing to 

Ti(n) Tpcin) \ 




Empirical best arm (EBA) 

Only considers arms i with Ti(n) ^ 1, computes their associated empirical means 



i 

fJ*i,n 



and forms the recommendation 



x ' s — 1 



Jn £ argmax /2i,„ 

i=l K 



(ties broken in some way). 



Most played arm (MPA) 

Recommends the most played arm, 

J n € argmax T;(n) 

i=l,...,K 

(ties broken in some way). 



Figure 3: Three recommendation strategies. 



[H; one can prove that, for n ^ K ^ 2, 

inf supEr,, > — 

where the infimum is taken over all forecasters while the supremum considers all 
sets of K distributions over [0, 1]. (The proof uses exactly the same reduction to 
a stochastic setting as in It is even simpler than in the indicated reference 
since here, only what happens at round n based on the information provided 
by previous rounds is to be considered; in the cumulative case considered in 
such an analysis had to be made at each round t ^ n.) 

4--1- A simple benchmark: the uniform allocation strategy 

As explained above, the combination of the uniform allocation with the rec- 
ommendation indicating the empirical best arm, forms an important theoretical 
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aKlnn ._ ._, 
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Table 1: Distribution-dependent (top) and distribution-free (bottom) upper bounds on the 
expected simple regret of the considered pairs of allocation (rows) and recommendation 
(columns) strategies. Lower bounds are also indicated. The □ symbols denote the universal 
constants, whereas the O are distribution-dependent constants. In parentheses, we provide 
the reference within this paper (index of the proposition, theorem, remark, corollary) where 
the stated bound is proved. 



benchmark. This section studies briefly its theoretical properties: the rate of 
decrease of its simple regret is exponential in a distribution-dependent sense and 
equals the optimal (up to a logarithmic term) 1 / y/n rate in the distribution- free 
case. 

Below, we mean by the recommendation given by the empirical best arm at 
round K\n/K\ the recommendation Jk\u/k\ of EBA (see Figure [3]), where \x\ 
denotes the lower integer part of a real number x. The reason why at round n 
we prefer Jk[u/k] to J n is only technical. The analysis is indeed simpler when 
all averages over the rewards obtained by each arm are over the same number 
of terms. This happens at rounds n multiple of K and this is why we prefer 
taking the recommendation of round K\n/ K\ instead of the one of round n. 

We propose first two distribution-dependent bounds, the first one is sharper 
in the case when there are few arms, while the second one is suited for large K . 

Proposition 1 (Distribution-dependent; Unif and EBA). The uniform al- 
location strategy associated with the recommendation given by the empirical best 
arm (at round K[n/K\) ensures that 

Er„ < A i e_A? ln/Ki f° r alln^K ; 

i:Ai>0 

and also, for all r\ € (0, 1) and all n ^ max < K 



Er„ ^ max exp 

i 2 — 1 K 



KlnK 
?7 2 A 2 

ci - v y- i 

' K 
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Proof. To prove the first inequality, we relate the simple regret to the proba- 
bility of choosing a non-optimal arm, 



Er„ = EA Jn = 



E 

i:A;>0 



AjP{J„=i} < 



E 

i:A;>0 



where the upper bound follows from the fact that to be the empirical best arm, 
an arm i must have performed, in particular, better than a best arm i*. We 
now apply Hoeffding's inequality for independent bounded random variables, 
see The quantities /t,- )n — /2i*, n are given by a (normalized) sum of 2\n/K\ 
random variables taking values in [0, 1] or in [—1, 0] and have expectation — Aj. 
Thus, the probability of interest is bounded by 



^0}=P{(/' 



^ exp 



v 




exp 



which yields the first result. 

The second inequality is proved by resorting to a sharper concentration ar- 
gument, namely, the method of bounded differences, see [ll], see also [1, Chap- 
ter 2] . The complete proof can be found in Section |Appendix A.l| 

The distribution-free bound of Corollary [3] is obtained not directly as a 
corollary of Proposition!!] but as a consequence of its proof. (It is not enough to 
optimize the bound of Proposition[T]ovcr the Aj, for it would yield an additional 
multiplicative factor of K .) 

Corollary 3 (Distribution- free; Unif and EBA). The uniform allocation 
strategy associated with the recommendation given by the empirical best arm 
(at round K\n/K\) ensures that 



sup Er„ ^ 2 

Vl,...,UK 



K\nK 



K 



where the supremum is over all K -tuples (y\, . . . , vk) of distributions over [0, 1]. 
Proof. We extract from the proof of Proposition [1] that 



P{ J n = i}<^ exp (- 



A 



we now distinguish whether a given Aj is more or less than a threshold e, use 
that ^{Jn = i} = 1 and A, ^ 1 for all i, to write 



A" 



= J2 A i P { J n = < £ + E A » P { Jn = *} 

i=l i:Ai>E 



(2) 



^2 Aj exp 

i:A;>e 
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A simple study shows that the function x £ [0, 1] >->• x exp(— Cx 2 ) is decreasing 
on [1/V2C, l], for any C > 0. Therefore, taking C — \n/K\, we get that 
whenever e ^ l/y/2\n/K\, 

Er n < e + (X - 1) e cxp (-e 2 [-^J) . 



Substituting e = ^/ (\nK)/[n/K\ concludes the proof. 

4-2. Analysis of UCB(a) as an allocation strategy 

We start by studying the recommendation given by the most played arm. 
A (distribution-dependent) bound is stated in Theorem [2j the bound does not 
involve any quantity depending on the Aj, but it only holds for rounds n large 
enough, a statement that does involve the Aj. Its interest is first that it is 
simple to read, and second, that the techniques used to prove it imply easily a 
second (distribution- free) bound, stated in Theorem [3] and which is comparable 
to Corollary [3] 

Theorem 2 (Distribution-dependent; UCB(a) and MPA). For a > 1, 

the allocation strategy given by UCB(a) associated with the recommendation 
given by the most played arm ensures that 

K (n \ 2(l-o) 
Er„ < f — - 1 



a- 1 \K 

A:K cy In n 

for all n sufficiently large, e.g., such that n K H — and n K(K + 2). 

The polynomial rate in the upper bound above is not a coincidence accord- 
ing to the lower bound exhibited in Corollary [2J Here, surprisingly enough, 
this polynomial rate of decrease is distribution-free (but in compensation, the 
bound is only valid after a distribution-dependent time). This rate illustrates 
Theorem Q] the lar ger a, the larger the (theoretical bound on the) cumulative 
regret of UCB(a) but the smaller the simple regret of UCB(a) associated with 
the recommendation given by the most played arm. 

Theorem 3 (Distribution-free; UCB(a) and MPA). For a > 1, the allo- 
cation strategy given by UCB(a) associated with the recommendation given by 
the most played arm ensures that, for all n K(K + 2), 



jAKalnn K 

sup Er„ ^ 



,,, v n— K a — 1 \K 




where the supremum is over all K-tuples (y\, . . . , Vjc) of distributions over [0, 1]. 
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4-2.1. Proofs of Theorems [H and [3] 

We start by a technical lemma from which the two theorems will follow 
easily. 

Lemma 1. Let a\, . . . ,ax be real numbers such that a± + ... + &/< = 1 and 

O'i for all i, with the additional property that for all suboptimal arms i and 
all optimal arms i* , one has ai ^ a^. . Then for a > 1, the allocation strategy 
given by U CB(ct) associated with the recommendation given by the most played 
arm ensures that 

Er^^-VM-l) 2 * 1 -) 
a — 1 *■ — ' 

for all n sufficiently large, e.g., such that, for all suboptimal arms i, 

4a Inn 

a^n 1 H t-k — and atn ^ K + 2 . 

Proof. We first prove that whenever the most played arm J„ is different from 
an optimal arm i* , then at least one of the suboptimal arms i is such that Ti(n) ^ 
ain. To do so, we use a contrapositive method and assume that Tj(n) < a^n for 
all suboptimal arms. Then, 

K 

■ n = y^T t -(n) < Tj» (n) + ^ a;n 

i— 1 i 

where, in the inequality, the first summation is over the optimal arms, the second 
one, over the suboptimal ones. Therefore, we get 




E 



a,«n < 



and there exists at least one optimal arm i* such that T;*(n) > a^n. Since by 
definition of the vector (oi, . . . , ax), one has ai for all suboptimal arms, 

it comes that Ti(n) < et^n ^ a^*n < Tj*(n) for all suboptimal arms, and the 
most played arm J n is thus an optimal arm. 
Thus, using that Aj ^ 1 for all i, 

Er n = EA Jn ^2 p { T ii n ) > aw) ■ 

i:A;>0 

A side-result extracted from P, proof of Theorem 7] , see also 0, proof of Theo- 
rem 1], states that for all suboptimal arms i and all rounds t ^ K + 1, 

P|j t = i and 7j(t - 1) > < 2i 1 ~ 2a whenever ^ ^ . (3) 

We denote by \x\ the upper integer part of a real number x. For a suboptimal 
arm i and since by the assumptions on n and the aj, the choice € = \difi\ — 1 
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satisfies £ ^ K + 1 and £ ^ (4a Inn) /A? , 
¥{T t (n) > fl, n } = P{T,(n) ^ f^n] } 

n 

< X! p { T i(* - !) = r^i - 1 and j t = *} 



t— \ain\ 

n roo i 

^ Y 2t 1 - 2a ^2 v^dv^ ^—( ai n-\f^ , (4) 

^r-i»i-i 

where we used a union bound for the second inequality and Q for the third 
inequality. A summation over all suboptimal arms i concludes the proof. 

Proof (of Theorem [5]). It consists in applying Lemma Q] with the uniform 
choice a,i — l/K and recalling that A is the minimum of the Aj > 0. 

Proof (of Theorem [3]). We start the proof by using that J]P{J„ = i} = 1 
and Aj 1 for all i, and can thus write 

K 

Er n = EA Jn =J2 A i p { J " = *} < e + Ai P{ J„ - i} . 

i=l i:Ai>e 

Since J„ = i only if Tj(n) n/K, we get 

Er„ < e+ ]T AiP{Ti(n) ^ ^} ■ 

i:Ai>e 

Applying (|4]) with = l/K leads to 

A, / n \2(i-«) 

i:Ai>e 

where e is chosen such that for all A, > e, the condition 

I > n/A - 1 > (4a In n) /A? 

is satisfied {n/K — 1 ^ A" + 1 being satisfied by the assumption on n and A). 
The conclusion thus follows from taking, for instance, 



e = \/ (AaK In n)/(n - A) 
and upper bounding all remaining Aj by 1. 

4-2.2. Other recommendation strategies 

We discuss here the combination of UCB(a) with the two other recommen- 
dation strategies, namely, the choice of the empirical best arm and the use of 
the empirical distribution of plays. 
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Remark 3 (UCB(a) and EDP). Wc indicate in this remark from which re- 
sults the corresponding bounds of Table [T] follow. As noticed in the beginning 
of Section [3J in the case of a recommendation formed by the empirical distribu- 
tion of plays, the simple regret is bounded in terms of the cumulative regret as 
Er n < ER n /n. Now, the results in indicate that the cumulative regret of 

UCB(a) is less than something of the form 

^ i 3K K 
(_} a In n + 



2 2(q - 1) ' 

where Q denotes a constant dependent on v\ , . . . , z/r- . The distribution- free 
bound on ER n (and thus on Er„) follows from the control, yielded by ([3]) and 
a summation, 

, „ 4a Inn 3 1 
ET ^^ + 2 + 2(a^l)' 

together with the concavity argument 

ERn= ^2 A l ET i (n) = ^ (a, y/ETi{nj) y/ETi{n) 

i:A;>0 i:Ai>0 




i:Ai>0 



where Jensen's inequality guaranteed that ^ ^jET\{n) ^ 



Remark 4 (UCB(a) and EBA). We can rephrase the results of [lj] as us 



ing UCB1 as an allocation strategy and forming a recommendation according to 



the empirical best arm. In particular, 14j, Theorem 5] provides a distribution 



dependent bound on the probability of not picking the best arm with this pro- 
cedure and can be used to derive the following bound on the simple regret of 
UCB(a) combined with EBA: for all n > 1, 



i:Ai>0 V 



where p a is a positive constant depending on a only. The leading constants 
1/Aj and the distribution-dependent exponent make it not as useful as the one 
presented in Theorem [2] The best distribution-free bound we could get from 
this bound was of the order of 1/VPa Inn, to be compared to the asymptotic 
optimal 1/y/n rate stated in Theorem [3j 



5. Conclusions for the case of finitely many arms: Comparison of the 
bounds, simulation study 

We first explain why, in some cases, the bound provided by our theoretical 
analysis in Lemma[T](for UCB(a) and MPA) is better than the bound stated in 
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Proposition!]] (for Unif and EBA). The central point in the argument is that the 
bound of Lemma [T] is of the form O n 2 ^ a \ for some distribution-dependent 
constant Q, that is, it has a distribution-free convergence rate. In comparison, 
the bound of Proposition[T]involves the gaps A, in the rate of convergence. Some 
care is needed in the comparison, since the bound for UCB(a) holds only for n 
large enough, but it is easy to find situations where for moderate values of n, the 
bound exhibited for the sampling with UCB(a) is better than the one for the 
uniform allocation. These situations typically involve a rather large number K 
of arms; in the latter case, the uniform allocation strategy only samples \ n/K\ 
times each arm, whereas the UCB strategy focuses rapidly its exploration on 
the best arms. A general argument is proposed in Section |Appendix A.2| as 
well as a numerical example, showing that for moderate values of n, the bounds 
associated with the sampling with UCB(a) are better than the ones associated 
with the uniform sampling. This is further illustrated numerically, in the right 
part of Figure H]) . 

To make short the longer story described in this paper, one can distinguish 
three regimes, according to the value of the number of rounds n. The state- 
ments of these regimes (the ranges of their corresponding n) involve distribution- 
dependent quantifications, to determine which n are considered small, moderate, 
or large. 

• For large values of n, uniform exploration is better (as shown by a com- 
bination of the lower bound of Corollary [5] and of the upper bound of 
Proposition [l) . 

• For moderate values of n, sampling with UCB (a) is preferable, as discussed 
just above (and in Section [Appendix A.2 ). 

• For small values of n, little can be said and the best bounds to consider 
are perhaps the distribution-free bounds, which are of the same order of 
magnitude for the two pairs of strategies. 



We propose two simple experiments to illustrate our theoretical analysis; 
each of them was run on 10 4 instances of the problem and we plotted the 
average simple regret. This is an instance of the Monte-Carlo method and 
provides accurate estimators of the expected simple regret Er n . 

The first experiment (upper plot of Figure shows that for small values of n 
(here, n ^ 80), the uniform allocation strategy can have an interesting behavior. 
Of course the range of these "small" values of n can be made arbitrarily large 
by decreasing the gap A. The second one (lower plot of Figure [4]) corresponds 
to the numerical example to be described in Section |Appcndix A.2 In both 



cases, the unclear picture for small values of n become clearer for moderate 
values and shows an advantage in favor of UCB-based allocation strategies. It 
also appears (here and in other non reported experiments) that it is better in 
practice to use recommendations based on the empirical best arm rather than 
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on the most played arm. In particular, the theoretical upper bounds indicated 
in this paper for the combination of UCB as an allocation strategy and the 
recommendation based on the empirical best arm (see Remark 2]) are probably 
to be improved. 

Remark 5. We mostly illustrated here the small and moderate n regimes. This 
is because for large n, the simple regret is usually very small, even below com- 
puter precision. Therefore, because of the chosen ranges, we do not see yet the 
uniform allocation strategy getting better than UCB-based strategies, a fact 
that is true however for large enough n. This has an important impact on the 
interpretation of the lower bound of Theorem [TJ While its statement is in finite 
time, it should be interpreted as providing an asymptotic result only. 

6. Pure exploration for continuous armed bandits 

This section is of theoretical interest. We consider the A'-armed bandit 
problem already studied, e.g., in @, [l2|, and (re)define the notions of cumu- 
lative and simple regret in this setting. We show that the cumulative regret 
can be minimized if and only if the simple regret can be minimized, and use 
this equivalence to characterize the metric spaces X in which the cumulative 
regret can be minimized: the separable ones. Here, in addition to its natural 
interpretation, the simple regret thus appears as a tool for proving results on 
the cumulative regret. 

6.1. Description of the model of X -armed bandits 

We consider a bounded interval of K, say [0, 1] again. We denote by "P([0, 1]) 
the set of probability distributions over [0,1]. Similarly, given a topological 
space X, we denote by V(X) the set of probability distributions over X. We 
then call environment on X any mapping E : X — > V([Q, 1]). We say that E is 
continuous if the mapping that associates to each x £ X the expectation /i(x) 
of E{x) is continuous; we call the latter the mean-payoff function. 

The A'-armed bandit problem is described in Figures [5] and [6] There, an 
environment E on X is fixed and we want various notions of regret to be small, 
given this environment. 

We consider now families of environments and say that a family J- of en- 
vironments is explorable-exploitable (respectively, explorable) if there exists a 
forecaster such that for any environment E 6 J 7 , the expected cumulative regret 
Ei?„ (expectation taken with respect to E and all auxiliary randomizations) is 
o(n) (respectively, Er„ = o(l)). Of course, explorability of J 7 is a milder re- 
quirement than explorability-exploitability of J 7 , as can be seen by considering 
the recommendation given by the empirical distribution of plays of Figure [3] and 
applying the same argument as the one used at the beginning of Section [3] 

In fact, it can be seen that the two notions are equivalent, and this is why 
we will henceforth concentrate on explorability only, for which characterizations 
as the ones of Theorem |4] are simpler to exhibit and prove. 
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Figure 4: K = 20 arms with Bernoulli distributions of parameters indicated on top of each 
graph, x-axis: number of rounds n; j/-axis: simple regrets Er n (estimated by a Monte-Carlo 
method). 
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Parameters: an environment E : X — »• 7 7 ([0, 1]) 
For each round £ = 1,2,..., 

(1) the forecaster chooses a distribution ip t £ V{X) and pulls an arm I t 
at random according to tpt] 

(2) the environment draws the reward Y t for that action, according to 
E{I t ). 

Goal: 

Find an allocation strategy (ip t ) such that the cumulative regret 

n 

R n = n sup n(x) - u(I t ) 
is small (i.e., o(n), in expectation). 



Figure 5: The classical X -armed bandit problem. 

Lemma 2. A family of environments T is explorable if and only if it is explorable- 
exploitable. 

The proof can be found in Section 16.31 It relies essentially on designing a 
strategy suited for cumulative regret from a strategy minimizing the simple re- 
gret; to do so, exploration and exploitation occur at fixed rounds in two distinct 
phases and only the payoffs obtained during exploration rounds are fed into the 
base allocation strategy. 

6.2. A positive result for metric spaces 

We denote by "P([0, 1])* the family of all possible environments E on X, 
and by C(V([0, 1}) X ) the subset of V([0, l]) x formed by the continuous environ- 
ments. 

Example 1. Previous sections were about the family V([0, l]) x of all environ- 
ments over X = {1, . . . , K} being explorable. 

The main result concerning A"-armed bandit problems is formed by the 
following equivalences in metric spaces. It generalizes the result of Example [1] 

Theorem 4. Let X be a metric space. Then the family C(V([0,1]) X ) is ex- 
plorable if and only if X is separable. 

Corollary 4. Let X be a set. The family "P([0, 1])"* is explorable if and only if 
X is countable. 
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Parameters: an environment E : X — »• 7 7 ([0, 1]) 
For each round £ = 1,2,..., 

(1) the forecaster chooses a distribution ip t £ V(X) and pulls an arm I t 
at random according to tpt] 

(2) the environment draws the reward Y t for that action, according to 
E{It)\ 

(3) the forecaster outputs a recommendation i/j t G V(X)\ 

(4) if the environment sends a stopping signal, then the game takes an 
end; otherwise, the next round starts. 

Goal: 

Find an allocation strategy (ipt) and a recommendation strategy (tpt) such 
that the simple regret 

r n = sup /j(x) - / pb(x) dij} n (x) 

xEX J X 

is small (i.e., o(l), in expectation). 



Figure 6: The pure exploration problem for X -armed bandits. 



The proofs can be found in Section 16.41 Their main technical ingredient 
is that there exists a probability distribution over a metric space X giving a 
positive probability mass to all open sets if and only if X is separable. Then, 
whenever it exists, it allows some uniform exploration. 



Remark 6. We discuss here the links with results reported recently in [13J . The 
latter restricts its attention to a setting where the space X is a metric space 
(with metric denoted by d) and where the environments must have mean-payoff 
functions that are 1-Lipschitz with respect to d. Its main concern is about 
the best achievable order of magnitude of the cumulative regret with respect 
to T. In this respect, its main result is that a distribution-dependent bound 
proportional to log(T) can be achieved if and only if the completion of X is a 
compact metric space with countably many points. Otherwise, bounds on the 
regret are proportional to at least VT. In fact, the links between our work and 
this article are not in the statements of the results proved but rather in the 
techniques used in the proofs. 

6.3. Proof of Lemma\^ 

Proof. In view of the comments before the statement of Lemma [2 we need 
only to prove that an explorable family T is also explorable-exploitable. We 
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consider a pair of allocation (ip t ) and recommendation (rp t ) strategies such that 
for all environments E € T, the simple regret satisfy Er„ = o(l), and provide 
a new strategy (<p' t ) such that its cumulative regret satisfies Ei?^ = o(n) for all 
environments E € T . 

It is defined informally as follows. At round t = 1, it uses ^ = ipi and gets 
a reward Y\. Based on this reward, the recommendation Vi(^i) i s formed and 
at round t = 2, the new strategy plays ip' 2 (Y\) — ^i(^i)- It gets a reward I2 but 
does not take it into account. It bases its choice ip' 3 (Yi 7 Y 2 ) = ^2(^1) only on 
Y\ and gets a reward Y3. Based on Y\ and Ys, the recommendation ^(^l,^) 
is formed and played at rounds t = 4 and i = 5, i.e., 

<P4(Yi,Y 2 ,Y 3 )=<p' s (Y u Y 2 ,Y 3 ,Yi) = MYi,Y 3 ) . 

And so on: the sequence of distributions chosen by the new strategy is formed 
using the applications 

Vi, 0i, 

</>2, 02, V"2, 

<p 3 , 3 , 03, -03, 

</>4, 04, 04, 04, 04, 

V5, 05, 05, 05, 05, 05, 



Formally, we consider regimes indexed by integers t ^ 1 and of length 1 + 1. 
The t-th regime starts at round 

, ^-L . t(t - 1) 

l + ^(l + s ) = t + ^_^ = ^_^ . 

s=l 

During this regime, the following distributions are used, 



Vt(t+l)/2+k 



Pt(fr(»+i)/2) a=1 ,... it _i) if fc = 0; 

k 0t((n( S +i)/ 2 ) 5=li ... it _ 1 ) 



if 1 < & < t. 



Note that we only keep track of the payoffs obtained when k = in a regime. 

The regret R' n at round n of this strategy is as follows. We decompose n in 
a unique manner as 

t(n)(t(n) + l) 

n= 2 L +k(n) where k(n) € {0, . . . , t(n)} . (5) 

Then (using also the tower rule), 

E< < i(n) + (En + 2Er 2 + . . . + (t(n) - l) Er t(n) _i + fc(n)Er t(n) ) 
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where the first term comes from the time rounds when the new strategy used 
the base allocation strategy to explore and where the other terms come from 
the ones when it exploited. This inequality can be rewritten as 



ER' n ,t(n) k{n) Er t(n) + g 



sEr 



s=l 



s 



n n n 



which shows that ER' n = o(n) whenever Er s = o(l) as s — > oo, since the 
first term in the right-hand side is of the order of and the second one 

is a Cesaro average. This concludes that the exhibited strategy has a small 
cumulative regret for all environments of the family, which is thus explorable- 
exploitable. 

6.4- Proof of Theorem^ and its corollary 

The key ingredient is the following characterization of separability (which 
relies on an application of Zorn's lemma); see, e.g., [H, Appendix I, page 216]. 

Lemma 3. A metric space X , with distance denoted by d, is separable if and 
only if it contains no uncountable subset A such that 



Separability can then be characterized in terms of the existence of a prob- 
ability distribution with full support. Though it seems natural, we did not see 
any reference to it in the literature and this is why we state it. (In the proof of 
Theorem we will only use the straightforward direct part of the characteriza- 
tion.) 

Lemma 4. Let X be a metric space. There exists a probability distribution A 
on X with A( V) > for all open sets V if and only if X is separable. 

Proof. We prove the converse implication first. If X is separable, we denote 
by xi, X2, ■ • ■ a dense sequence. If it is finite with length N, we let 



The result follows, since each open set V contains at least some Xj. 

For the direct implication, we use Lemma [3] (and its notations) . If A" is 
not separable, then it contains uncountably many disjoint open balls, formed 
by the B(a,p/2), for a E A. If there existed a probability distribution A with 
full support on X, it would in particular give a positive probability to all these 
balls; but this is impossible, since there are uncountably many of them. 



p = inf {d(x, y) : x, y € A} > . 




and otherwise, 
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6.4.1. Separability of X implies explorability of the family C(V([Q,1]) X ) 

The proof of the converse part of the characterization provided by Theorem|4] 
relies on a somewhat uniform exploration that hits each open set of X after a 
random waiting time with distribution depending on the probability of the open 
set. 

Proof. Since X is separable, there exists a probability distribution A on X 
with X(V) > for all open sets V, as asserted by Lemma S) 

The proposed strategy is then constructed in a way similar to the one ex- 
hibited in Section [Appendix A.2| in the sense that we also consider successives 
regimes, where the t-th of them has also length 1 + t. They use the following 
allocations, 



( Pt(t+l)/2+k 



A 

°- f fc(fc + l)/2 



if k = 0; 
if 1 < k € t. 



Put in words, at the beginning of each regime, a new point It(t+i)/2 is drawn 
at random in X according to A, and then, all previously drawn points I s ( s +i)/2, 
for 1 ^ s ^ t — 1, and the new point It(t+i)/2 are pulled again, one after the 
other. 

The recommendations ip n are deterministic and put all probability mass on 
the best empirical arm among the hrst played g(n) arms (where the function g 
will be determined by the analysis). Formally, for all x € X such that 



T n {x) 



one defines 



Then, 



T n {x) ^ 



ipn = S x * where X* £ argmax ju n (l s ( s+ i)/ 2 ) 

l<s< S (n) 

(ties broken in some way, as usual; and g(n) to be chosen small enough so that 
all considered arms have been played at least once). Note that exploration and 
exploitation appear in two distinct phases, as was the case already, for instance, 
in Section SHJ 
We now denote 



M* = sup fi(x) and n* g{n) = 



max 



the simple regret can then be decomposed as 



(i 



Mfl(n) 



M 



E 



2G 



where the first difference can be thought of as an approximation error, and the 
second one, as resulting from an estimation error. We now show that both 
differences vanish in the limit. 

We first deal with the approximation error. We fix e > 0. Since the mean- 
payoff function fi is continuous on X ', there exists an open set V such that 

Va; € V, fi* - fi(x) ^ e . 

It follows that 

p{m*-^(„) >e} <P{v«e{l,...,0(n)}, I S{S+1) /2^V) 

< (l-\(V)) 9(n) 







provided that g(n) — >• oo (a condition that will be satisfied, see below). Since 
in addition, Mg(„) ^ M*! we g et 



limsup //* 



E 



For the difference resulting from the estimation error, we denote 
I* e argmax /i(J 8 ( s+ i)/ 2 ) 

(ties broken in some way). Fix an arbitrary e > 0. We note that if for all 
Ks^ g(n), 

Un(l s (s+l)/2) - m(^( s +i)/2) < e , 
then (together with the definition of X*) 

Thus, we have proved the inequality 



E 



-E 



> £ 



We use a union bound and control each (conditional) probability 



V'n{ls(s+l)/2) - (i(ls(s+l)/2) 



> e 



(6) 



(7) 



for 1 ^ s ^ g(n), where A n is the tr-algebra generated by the randomly drawn 
points Ik(k+i)/2i f° r those k with k(k + l)/2 ^ n. Conditionally to them, 
V-n{l s (s+i)/2) is an average of a deterministic number of summands, which only 
depends on s, and thus, classical concentration-of-the-measure arguments can 
be used. For instance, the quantities ([7]) are bounded, via an application of 
Hocffding's inequality [111 ], by 



2exp(-2r„(/ s(s+1)/2 )e 2 ) 
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We lower bound T n (l s ( s +i)/2) ■ The point / s ( s +i)/2 was pulled twice in regime s, 
once in each regime s+1, . . . , t(n) — 1, and maybe in t(n), where n is decomposed 
again as in ([S]). That is, 

T n (l s ( s+ i) / 2 ) ^ i(n) - s + 1 ^ \/2n - 1 - g(n) , 
since we only consider s ^ <?(?i) and since ([5]) implies that 
< t(n) (t(rc)+3) < ft(n)+2) 2 



that is, t(n) > V2n - 2 



Substituting this in the Hoeffding's bound, integrating, and taking a union 
bound lead from ([6]) to 



E 



Mfl(n) 



-E 



l*(X*)] ^2e + 2g(n) exp (-2 - 1 - 9 (n)) e 2 ) 



Choosing for instance 3(71) = \pnj2 ensures that 



lim sup 



M ff (n) 



- E 



< 2e 



Summing up the two superior limits, we finally get 



lim sup Er„ ^ lim sup p* 



E 



M ff (r 



lim sup E 



M ff (r 



< 3e 



since this is true for all arbitrary e > 0, the proof is concluded. 



6-4-2. Explorability of the family C(V([0,1}) X ) implies separability of X 

We now prove the direct part of the characterization provided by Theorem |4l 
It basically follows from the impossibility of a uniform exploration, as asserted 
by Lemma SJ 

Proof. Let X be a non-separable metric space with metric denoted by d. Let 
A be an arbitrary uncountable subset of X and let p > be defined as in 
Lemma[3] in particular, the balls B(a,p/2) are disjoint, for a G A. 

We now consider the subset of C(V([0,1)) X ) formed by the environments 
E a defined as follows. They are indexed by a £ A and their corresponding 
mean-payoff functions are given by 



p a : x e X 



d(x, a) 
p/2 



The associated environments E a are deterministic, in the sense that they are 
defined as E a (x) = S^m. Note that each p a is continuous, that fi a (x) > for 
all x £ B(a,p/2) but p a (x) = for all x G X \ B(a,p/2); that the best arm 
under E a is a and that its gets a reward equal to p* = p a {fl) = 1. 
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We fix a forecaster and denote by E a the expectation under environment E a 
with respect with the auxiliary randomizations used by the forecaster. Since p, a 
vanishes outside (B(a, p/2)) and has a maximum equal to 1, 



E a r n 1 1„ 



(jb a (x) dtp n (x) 



> 1-E„ 



i> n (B(a,p/2)) 



We now show the existence of a non-empty set A' such that for all a e A' and 
n > 1, 



Ea 



il> n (B(a,p/2)) 



; 



(8) 



this indicates that E a r„ = 1 for all n ^ 1 and a € A', thus preventing in 
particular C("P([0, l]) 1 *) from being explorable by the fixed forecaster. 

The set A' is constructed by studying the behavior of the forecaster under 
the environment E$ yielding deterministic null rewards throughout the space, 
i.e., associated with the mean-payoff function x G X n- po(x) = 0. In the first 
round, the forecaster chooses a deterministic distribution ipi — ip\ over A", picks 
1 1 at random according to cp®, gets a deterministic payoff Y\ = 0, and finally 
recommends V'i(-^i) = V'i(-^i)^i) (which depends on Ii only, since the obtained 
payoffs are all null in a deterministic way). In the second round, it chooses 
an allocation ^2(^1) (that depends only on I\, for the same reasons as before), 
picks I2 at random according to ^2(^1): g ets a nuu reward, and recommends 
^2(^1)^2); and so on. 

We denote by A the probability distribution giving the auxiliary randomiza- 
tions used to draw the I t at random, and for all integers t and all measurable 
applications 



(*i 



x^&X* ^v{x u ...,x t )&V{X) 



we introduce the distributions A • v e V(X) defined as the following mixture of 
distributions. For all measurable sets V d X, 



v{V) = E A 



V dv(I u ...,I t ) 



A probability distribution can only put a positive mass on an at most countable 
number of disjoint sets. Therefore, let B n and C n be defined as the at most 



countable sets of a such that, respectively, 
probability mass to B(a,p/2). Then, let 



and A • -0° give a positive 



A' = A\[\jB n U |J C n 



be the uncountable, thus non empty, set of those elements of A which are in no 
B n or C„. 

By construction, for all a & A', the forecaster only gets null rewards; this 
is because a is in no B n and therefore, with probability 1, none of the 93° hits 
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B(a, p/2), which is exactly the set of those elements of X for which fi a > 0. As 
a consequence, the forecaster behaves similarly under the environments E a and 
Eo, which means that for all measurable sets V C X and all n ^ 1, 

E a [(p n (V)] = A ■ cp° n (V) and E a [t/; n (V)] = A ■ ^° n (V) . 

In particular, since a is in no C„, it hits in no recommendation ip^ the ball 
B(a, p/2), which is exactly what remained to be proved, see (jU). 

6.4-3. The countable case of Corollary^ 

We adopt an "a la Bourbaki" approach and derive this special case from the 
general theory. 

Proof. We endow X with the discrete topology, i.e., choose the distance 

d(x,y) = l{x^ y ] ■ 

Then, all applications defined on X are continuous; in particular, 

C(P([0,1]) X )=P([0,1]) X . 

In addition, X is then separable if and only if it is countable. The result thus 
follows immediately from Theorem [4] 

6.5. An additional remark about uniform bounds 

In this paper, we mostly consider non-uniform bounds (bounds that are 
individual as far as the environments are concerned). As for uniform bounds, 
i.e., bounds on quantities of the form 

sup Ei?„ or sup Er n 

for some family J 7 , two observations can be made. 

First, it is easy to see that no sublinear uniform bound can be obtained for 
the family of all continuous environments, as soon as there exists infinitely many 
disjoint open balls. 

However one can exhibit such sublinear uniform bounds in some specific 
scenarios; for instance, when X is totally bounded and T is formed by continuous 
functions with a common bounded Lipschitz constant. 
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Appendix A. Appendix 



Appendix A.l. Proof of the second statement of Proposition^ 

We use below the notations introduced in the proof of the first statement of 
Proposition [T] 

Proof. Since some regret is suffered only when an arm with suboptimal ex- 
pectation has the best empirical performance, 



max A, 

i=l,...,K 



i:Ai>0 

Now, the quantity of interest can be rewritten as 



max fii >n > fjbi* >n 



n 

K- 



max (a n 

i:Ai>0 



.X 



y n /K\ 



for some function /, where for all s = 1, ... , \ n/K\ , we denote by X s the vector 
(Xi, Sl . . . , Xk,s)- (The function / is defined as a maximum of at most K — 1 
sums of differences.) We apply the method of bounded differences, see [l8j], see 
also [1, Chapter 2]. It is straightforward that, since all random variables of 
interest take values in [0,1], the bounded differences condition is satisfied with 
ranges all equal to 2. Therefore, the indicated concentration inequality states 
that 



max fii n fii* n 

i:A;>0 



for all e > 0. We choose 



E 



max [li n Hi* n 
i:A;>0 



^ e > ^ exp 



2 [n/K\ e 5 



-E 



max fi i n 

i:Ai>0 



J5 min A, 

i:Ai>0 



E 



max { /L . 

i:A,>0 L ' 



A,} 



(where we used that the maximum of K first quantities plus the minimum of K 
other quantities is less than the maximum of the K sums). We now argue that 



E 



max {f2 in 



In if 

; 



this is done by a classical argument, using bounds on the moment generating 
function of the random variables of interest. Consider 

Zi = [n/K\ (p,i,n ~ A*i*,n + Ai) 

for all i = 1, . . . , K] they correspond to centered sums of 2[n/K\ independent 
random variables taking values in [0, 1] or [—1, 0]. Hocffding's lemma (see, e.g., 
[1, Chapter 2]) thus imply that for all A > 0, 



E e 



< exp -A 2\n/K\ 



oxp(^A 2 Ln/^J 
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A well-known inequality for maxima of subgaussian random variables (see 
Chapter 2]) then yields 



max Zi 

i=l,...,K 



^ y/[n/K\ In if , 



which leads to the claimed upper bound. Putting things together, we get that 
for the choice 



-E 



max fi i n - fii* n 

i:A;>0 



^ min Ai 

i:A,->0 



InK 



> 



(for n sufficiently large, a statement made precise below), we have 

2[n/K\ e 2 ' 



max^. Mi ^ fii^n } ^ exp [ -- 



^ exp 

The result follows for n such that 



min A; — 

i:Ai>0 



\n/K\ 



min Aj 

i:Ai>0 



i — ; — t 3* (1 — T7J mm A, ; 
[n/K\ V "v.A^Q 1 ' 



the second part of the statement of Proposition Q] indeed only considers such n. 

Appendix A. 2. Detailed discussion of the heuristic arguments presented in Sec- 
tion^ 

We first state the following corollary to Lemma [TJ 

Theorem 5. The allocation strategy given by UCB(a) (where a > 1) associated 
with the recommendation given by the most played arm ensures that 



f3n 
A? 



2(l-a) 



for all n sufficiently large, e.g., such that 
n Act + 1 



and n> ^+l( A ') 2 



Inn ^ P $ 
where A' = maxj A, and we denote by K* the number of optimal arms and 

t>- 1 



K* ^ 1 
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Proof. We apply Lemma[T]with the choice a,i — /3/A| for all suboptimal arms 
i and = /3/A 2 for all optimal arms i* , where /? denotes the normalization 
constant. 

For illustration, consider the case when there is one optimal arm, one A- 
suboptimal arm and K — 2 arms that are 2A-suboptimal. Then 

1 _ 2 K-2 _& + K 
p ~ A2 + (2A)2 ~ "lA 5 " ' 

and the previous bound of Theorem [5] implies that 

1 / An \ 2(1 - Q) K-2 ' - 



for all n sufficiently large, e.g., 

n ^ max + 2)(6 + K), (4a + 1) f^rj ton} • (A.2) 

Now, the upper bound on Er„ given in Proposition [T] for the uniform allocation 
associated with the recommendation provided by the empirical best arm is larger 
than 

Ae -A 2 Ln/Kj ) for aU n ^ K 

Thus for n moderately large, e.g., such that n K and 

WA-JM4a + l)(^)!f, (A.3) 

the bound for the uniform allocation is at least 

Aexp (-A 2 (4a + 1) (j^f) If) = An-^+^+W , 

which may be much worse than the upper bound (|A.1[) for the UCB(a) strategy 
whenever K is large, as can be seen by comparing the exponents —2 (a — 1) 
versus -(4a + 1)(6 + K)/AK. 

The reason is that the uniform allocation strategy only samples [n/K\ each 
arm, whereas the UCB strategy focuses rapidly its exploration on the better 
arms. 
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